biplotEZ

User-friendly biplots in R



Centre for Multi-Dimensional Data Visualisation (MuViSU)
muvisu@sun.ac.za



SASA 2024

What is Correspondence Analysis?

  • Aims to expose the association between two categorical variables.

  • Categorical variables measure characteristics of individuals (samples) in the form of finite discrete response levels (category levels).

  • Summarised in a two-way contingency table.

  • Focus placed on nominal categorical variables - category levels with no specific rank / order.

  • Numerous variants of CA are available for the application to diverse problems, the interested reader is referred to: Gower, Lubbe, and Roux (2011), Beh and Lombardo (2014).

  • With biplotEZ, focus is placed on three EZ-to-use variants (more information to follow).

Correspondence Analysis

  • Data matrix in CA() is different from PCA() and CVA().

  • \(\mathbf{X}:r\times c\), represents the cross-tabulations of two categorical variables.

  • The elements of the data matrix represent the frequency of the co-occurrence of two particular levels of the two variables.

Consider the HairEyeColor data set in R, which summarises the hair and eye color of male and female statistics students. For the purpose of this example only the male students will be considered:

cross_tab <- HairEyeColor[,,2]
cross_tab
#        Eye
# Hair    Brown Blue Hazel Green
#   Black    36    9     5     2
#   Brown    66   34    29    14
#   Red      16    7     7     7
#   Blond     4   64     5     8

Current functions for CA() in biplotEZ

  • biplot() |>
    • CA() |>
    • interpolate() |> fit.measures() |>
    • samples() |> newsamples() |>
    • legend.type() |>
  • plot()

Take note of the warning message:

biplot(HairEyeColor[,,2], center = TRUE) |> CA()
# Warning in CA.biplot(biplot(HairEyeColor[, , 2], center = TRUE)): Centering was
# not performed. Set biplot(center = FALSE) when performing CA().
# Object of class CA, based on 4 row levels and 4 column levels.

CA calculations

  • It is typical to express the frequencies in terms of proportions / probabilities.

  • Consider the correspondence matrix \(\mathbf{P}\):

P_mat <- cross_tab/sum(cross_tab)
P_mat
#        Eye
# Hair          Brown        Blue       Hazel       Green
#   Black 0.115015974 0.028753994 0.015974441 0.006389776
#   Brown 0.210862620 0.108626198 0.092651757 0.044728435
#   Red   0.051118211 0.022364217 0.022364217 0.022364217
#   Blond 0.012779553 0.204472843 0.015974441 0.025559105

CA calculations

  • Row profiles (diagonal matrix) - \(\mathbf{D_r}\): update with biplotEZ
ca.out <- biplot(HairEyeColor[,,2], center = FALSE) |> CA()
ca.out$Dr
#           [,1]     [,2]      [,3]      [,4]
# [1,] 0.1661342 0.000000 0.0000000 0.0000000
# [2,] 0.0000000 0.456869 0.0000000 0.0000000
# [3,] 0.0000000 0.000000 0.1182109 0.0000000
# [4,] 0.0000000 0.000000 0.0000000 0.2587859
  • Column profiles (diagonal matrix) - \(\mathbf{D_c}\):
ca.out <- biplot(HairEyeColor[,,2], center = FALSE) |> CA()
ca.out$Dc
#           [,1]      [,2]      [,3]       [,4]
# [1,] 0.3897764 0.0000000 0.0000000 0.00000000
# [2,] 0.0000000 0.3642173 0.0000000 0.00000000
# [3,] 0.0000000 0.0000000 0.1469649 0.00000000
# [4,] 0.0000000 0.0000000 0.0000000 0.09904153

CA calculations

  • Consider the independence model:

\[\chi^2 = \frac{(\text{Observed freq.}-\text{Expected freq.})^2}{\text{Expected freq.}}\]

  • Standardised Pearson residuals (\(\mathbf{S}\)):

\[ \mathbf{S} = \mathbf{D_r}^{-\frac{1}{2}}(\mathbf{P}-\mathbf{rc'})\mathbf{D_c}^{-\frac{1}{2}}\]

  • In terms of the weighted row and column profiles (\(\mathbf{D_r}^{-\frac{1}{2}}\) and \(\mathbf{D_c}^{-\frac{1}{2}}\)).

  • The expected frequencies represented by the product of the row and column profiles ().

  • Biplot coordinates: singular value decomposition of \(\mathbf{S}\).

\[ \text{svd}(\mathbf{S}) = \mathbf{U\Lambda V'}\]

biplotEZ variants

  • Variant refers to the contribution of the singular values (\(\Lambda\)) in the biplot solution.

    • Row principal coordinate biplot (default):

    \[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U\Lambda}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V}\end{aligned}\]

    • Row standard coordinate biplot:

      \[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V\Lambda}\end{aligned}\]

    • Symmetric Correspondence Analysis map:

      \[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U\Lambda^{\frac{1}{2}}}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V\Lambda^{\frac{1}{2}}}\end{aligned}\]

CA function

CA()
Argument Description
bp Object of class biplot.
dim.biplot Dimension of the biplot. Only values 1, 2 and 3 are accepted, with default 2.
e.vects Which eigenvectors (principal components) to extract, with default 1:dim.biplot.
variant which correspondence analysis variant, with default "Princ"
lambda.scal TRUE or FALSE: Controls stretching or shrinking of column and row distances, with default FALSE.

Row principal coordinate biplot

biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA() |> 
  plot()

Symmetric Correspondence Analysis map

biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA(variant = "Symmetric") |> 
  legend.type(samples = TRUE) |> 
  plot()

Row standard coordinate biplot

biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA(variant = "Stand") |> 
  plot()

Enhancing the visualisation

biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA(variant = "Stand", lambda.scal = TRUE) |>
  samples(col=c("palevioletred1","purple4")) |> 
  plot()

New samples

biplot(HairEyeColor[,,2], center = FALSE) |> CA(variant = "Symmetric") |>  
  samples(pch = c(0,2)) |> 
  interpolate(newdata = HairEyeColor[,,1]) |> 
  newsamples(col = c("orange","purple"), pch = c(15,17)) |> 
  plot()    

Fit measures

ca.out <- biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA(variant = "Symmetric") |> 
  fit.measures()
  • Quality:
ca.out$quality
# [1] 0.9833094
  • Adequacy:
ca.out$adequacy
#        Brown   Blue  Hazel  Green
# Dim 1 0.3824 0.5745 0.0425 0.0006
# Dim 2 0.5892 0.6277 0.3904 0.3926
# Dim 3 0.6102 0.6358 0.8530 0.9010
# Dim 4 1.0000 1.0000 1.0000 1.0000

Fit measures

ca.out <- biplot(HairEyeColor[,,2], center = FALSE) |> 
  CA(variant = "Symmetric") |> 
  fit.measures()
  • Row predictivities:
ca.out$row.predictivities
#        Black  Brown    Red  Blond
# Dim 1 0.6860 0.8630 0.4219 0.9974
# Dim 2 0.9932 0.9441 0.8508 0.9999
# Dim 3 1.0000 1.0000 1.0000 1.0000
# Dim 4 1.0000 1.0000 1.0000 1.0000
  • Column predictivities:
ca.out$col.predictivities
#        Brown   Blue  Hazel  Green
# Dim 1 0.9439 0.9898 0.4789 0.0113
# Dim 2 0.9990 0.9997 0.9020 0.8177
# Dim 3 1.0000 1.0000 1.0000 1.0000
# Dim 4 1.0000 1.0000 1.0000 1.0000

References

Beh, E, and Rosaria Lombardo. 2014. “Correspondence Analysis.” Theory, Paractice and New Strategies.
Gower, J. C., S. Lubbe, and N. J. le Roux. 2011. Understanding Biplots. Wiley.